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Abstract: We consider the problem of estimating an unknown ni x 712 
matrix 0* from noisy observations under the constraint that 0* is non¬ 
decreasing in both rows and columns. We consider the least squares 
estimator (LSE) in this setting and study its risk properties. We show 
that the worst case risk of the LSE is n -1 ' 2 , up to multiplicative log¬ 
arithmic factors, where n = n\U 2 and that the LSE is minimax rate 
optimal (up to logarithmic factors). We further prove that for some spe¬ 
cial 0 * , the risk of the LSE could be much smaller than n -1 ' 2 ; in fact, 
it could even be parametric, i.e., n -1 up to logarithmic factors. Such 
parametric rates occur when the number of “rectangular” blocks of 0* 
is bounded from above by a constant. We also derive an interesting adap¬ 
tation property of the LSE which we term variable adaptation - the 
LSE adapts to the “intrinsic dimension” of the problem and performs as 
well as the oracle estimator when estimating a matrix that is constant 
along each row/column. Our proofs, which borrow ideas from empiri¬ 
cal process theory, approximation theory and convex geometry, are of 
independent interest. 
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metric entropy bounds, minimax lower bound, tangent cone, variable 
adaptation. 


1. Introduction 


This paper studies the problem of estimating an unknown ri\ x 772 matrix 
6* under the constraint that 6* is nondecreasing in both rows and columns. 
In order to put this problem and our results in proper context, consider 
first the problem of estimating an unknown nondecreasing sequence under 
Gaussian measurements. Specifically, consider the problem of estimating 6* = 
6 *) e M n from observations 

Hi = 6* + 6i for i = 1,... ,n 

under the constraint that the unknown sequence 6* satisfies 9\ < • • • < 9* n . 
Here the unobserved errors ei,..., e n are i.i.d. iV(0, a 2 ) with a > 0 unknown. 
We refer to this as the vector isotonic estimation problem. This is a special 
case of univariate isotonic regression and has a long history; see e.g., Brunk 
[5], Ayer et al. [1], and van Eeden [24], The most commonly used estimator 
here is the least squares estimator (LSE) defined as 

n 

6 := argrnin (ly — 0/) 2 where C n := {6 € W 1 : 6 1 < ■ ■ ■ < 9 n }. (1) 


The behavior of 6 as an estimator of 6* is most naturally studied in terms of 
the risk: 

Rvec(0*,d) := -Ee4§-e*\\ 2 

n 

where || ■ || denotes the usual Euclidean norm. The subscript vec is used to 
indicate that this denotes the risk in the vector estimation problem. This risk 
R vec {6*,6) l ias been studied by a number of authors including van de Geer 
[20, 21], Donoho [11], Birge and Massart [4], Wang [25], Meyer and Woodroofe 
[16], Zhang [26] and Chatterjee et al. [7]. Zhang [26], among other things, 
showed the existence of a universal positive constant C such that 


Rvec(9*,0)<C 


a 2 \J~D(0* 


2/3 


+ 


cr 


log 77 


77 


77 


( 2 ) 
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with D(0*) := (0* — 6l) 2 . This result shows that the risk of 6 is no more than 
n~ 2 / 3 (ignoring constant factors) provided D{6*) is bounded from above by 
a constant. It can be proved that n~ 2 ^ is the minimax rate of estimation in 
this problem (see e.g., Zhang [26]). Throughout the paper, C will denote a 
universal positive constant even though its exact value might change from 
place to place. 

A complementary upper bound on R vec (9*,9) has been proved recently by 
Bcllec [3] who showed that 


Rvec(0*,9) < inf 
0eC n 


r-*n 5 


+ 


a 2 k(6) 


n 


n 


(log en] 


(3) 


where k{6) is the cardinality of the set {Oi,... ,6 n }. This result is an im¬ 
provement of a previous result by Chatterjee et al. [7] where inequality (3) 
was proved with an additional constant multiplicative factor. 

The two bounds (2) and (3) provide a holistic understanding of the global 
accuracy of the LSE 6 in vector isotonic estimation: its risk can never be larger 
than the minimax rate (a 2 \JD(6*) jn) 2,73 while it can be the parametric rate 
cr 2 /n, up to logarithmic multiplicative factors, if 6* can be well approximated 
by 6 E C n with small k(9). We refer to (2) as the worst case risk bound of the 
LSE and to (3) as the adaptive risk bound (adaptive because it states that 
the risk of the LSE is smaller than the worst case rate for certain special 9*). 

The goal of this paper is to extend both these worst case and adaptive risk 
bounds to the case of matrix isotonic estimation. Matrix isotonic estimation 
refers to the problem of estimating an unknown matrix 6* = (Q*j) E M niXri2 
from observations 


Vij = d *j + e ij, for i = 1,..., m, j = 1,..., n 2 , (4) 

where 6* is constrained to lie in 


M := {0 e M niXn2 ; e tJ < e kl whenever i < k and j < /}, 

and the random errors e t /s are i.i.d. N(0,a 2 ), with a 2 > 0 unknown. We 
refer to any matrix in A4 as an isotonic matrix. Throughout we let n := n i ri 2 
denote the product of ri\ and n 2 . As a notational convention, throughout the 
paper, we denote matrices in boldface and the (i,j)’th entry of a matrix A 
will simply be denoted by A %1 . 
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Monotonicity restrictions on matrices are increasingly being used as a key 
component of latent variable based models for the estimation of matrices 
and graphs. Two such examples are: (1) the estimation of graphons under 
monotonicity constraints (see Chan and Airoldi [6]), and (2) the nonpara- 
metric Bradley-Terry model (see Chatterjee [10]). In both of these examples, 
the unknown matrix satisfies monotonicity constraints similar to the ones 
studied here. But the observation model is more complicated because of the 
presence of latent permutations. Nevertheless, we believe that studying the 
matrix isotonic estimation problem described above is the first step towards 
understanding the estimation properties in these more complicated models. 

The matrix isotonic estimation problem also has a direct connection to bivari¬ 
ate isotonic regression. Bivariate isotonic regression is the problem of estimat¬ 
ing a regression function / : [0, l] 2 —> R which is known to be coordinate-wise 
nondecreasing (i.e., if Sj < t\ and s 2 < t 2 , where (si,s 2 ), (ti,t 2 ) G [0, l] 2 , 
then /(si,s 2 ) < f{ti,t 2 )), from observations 

Vij = f(i/ni,j/n 2 ) + ey, for i = 1,..., m, j = 1,..., n 2 . (5) 

Identifying f(i/rii,j/n 2 ) = 0 *- we see that (4) and (5) are equivalent prob¬ 
lems. (5) is possibly the simplest example of a multivariate shape constrained 
regression problem and arises quite often in production planning and inven¬ 
tory control; see e.g., the classical textbooks Barlow et ah [2] and Robertson 
et ah [19] on this subject. 

Let us now introduce the LSE in matrix isotonic estimation. Let y = (y^) 
denote the matrix (of order rii x n 2 ) of the observed response. The LSE, 6 , 
is defined as the minimizer of the squared Frobenius norm, \\y — 6 1| 2 , over 
0 e M, i.e., 

n\ ri2 

6 := argmin (6) 

°' M ; I 3=1 

Because Ad is a closed convex cone in R niXn2 (which is the space of all ri\ x n 2 
matrices), the LSE 6 exists uniquely. Further, it can be computed efficiently 
by an iterative algorithm (see e.g., Gebhardt [13] and Robertson et ah [19, 
Chapter 1]); this is in spite of the fact that it is defined as the solution of a 
quadratic program with 0[n 2 ) linear constraints. 

It is fair to say that not much is known about the behavior of 6 as an 
estimator of 6*. The only result known in this direction is the consistency of 
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9 ; see e.g., Hanson et al. [14], Makowski [15] and Robertson and Wright [18]. 
In this paper we study the risk of 9 as an estimator of 9*, defined as 

ni ri2 

R(9*,0) :=E e *f(9\0) where Co-.B) :=-££(e* -0, 3 ) 2 . 

i=l j =1 

Here Eg* denotes the expectation taken with respect to y having the dis¬ 
tribution given by (4). Also, throughout the paper, we take n = 77, 1 n 2 and 
each of ni and n 2 to be strictly larger than one. We similarly define the risk 
R(9*, 0) for any other estimator 6 of 9*. 

To the best of our knowledge, nothing is known in the literature about the 
risk R(0*,9). The goal of this paper is to prove analogues of the inequali¬ 
ties (2) and (3) for R(0*,0). The first result of this paper, Theorem 2.1, is 
the analogue of (2) for matrix isotonic estimation. Specifically, we prove in 
Theorem 2.1 that 

R(9*, 0) < C (x h D ^° ^ (logn) 4 + —(logn) 8 ] (7) 

\ V n n ) 


for a universal positive constant C where D{9*) := ( 9* nin2 — 9\ i) 2 . 

Our second result proves that the minimax risk in this problem is bounded 
from below by ( a 2 D(9*)/n ) 1 ^ 2 , up to constant multiplicative factors. Specif¬ 
ically, we prove in Theorem 2.2 that 

inf sup R(0,0) > J (8) 

9 OeM:D(0)<D V L't)Zn 

under some conditions on ri\ and n 2 (see Theorem 2.2 for the precise state¬ 
ment). The above inhmum is taken over all estimators 9 of 9. Combined with 
(7), this proves that 9 is minimax, up to logarithmic multiplicative factors. 
Therefore, inequality (7) is the correct analogue of (2) for matrix isotonic 
estimation. 


Next we describe our analogue of inequality (3) for matrix isotonic estima¬ 
tion. The situation here is more subtle compared to the vector case. The 
most natural analogue of (3) in the matrix case is an inequality of the form: 


R(9*, 0) < inf 

6 >€M 



a 2 c(0)p(logn)\ 
n J 


(9) 
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where p(-) is some polynomial and c(0) denotes the cardinality of the set 
{ 0 i3 : 1 < i < ni, 1 < j < 772} and || ■ || refers to the Frobenius norm. 
Unfortunately it turns out that this inequality cannot be true for every 9* E 
A4 because it contradicts the minimax lower bound (8). The argument for 
this is provided at the beginning of Section 2.2. 

The fact that inequality (9) is false means that the LSE 6 does not adapt to 
every 6* £ A4 with small c(0*). It turns out that inequality (9) can be proved 
for every 6* £ M. if the quantity c(0) is replaced by a larger quantity. This 
quantity will be denoted by k(0) (because it is the right analogue of k{6) for 
the matrix case) and it is defined next after introducing some notation. 

A subset A of {1,..., 771 } x {1,..., 772 } is called a rectangle if A = {( i,j ) : 
k\ < i < h,k 2 < j < I 2 } for some 1 < k\ < l\ < n\ and 1 < k 2 < 
l 2 < 77 2 . A rectangular partition of {1, ...,77!} x {1,..., 77 2 } is a collec¬ 
tion of rectangles n = (Ax,..., A*.) which are disjoint and whose union is 
{1,..., 77i} x {1,..., 772 }. The cardinality of such a partition, 17r|, is the num¬ 
ber of rectangles in the partition. The collection of all rectangular parti¬ 
tions of (1, ...,?7i} x {1,..., 772 } will be denoted by V. For 9 E M and 
7T = (A 1; ..., A*,) £ V, we say that 9 is constant on 7r if { 9 i3 : (i,j) £ A{\ is a 
singleton for each l — 1,... ,k. We are now ready to define k(0) for 9 £ A4. 
It is defined as the “number of rectangular blocks” of 9 , i.e., the smallest 
integer k for which there exists a partition n E V with \n\ = k such that 9 
is constant on n. It is trivial to see that k(9) > c(9) for every 9 E AT As a 
simple illustration, for 9 = l{i > 1 ,j> 1}, we have c(0) = 2 and k(0) = 3. 

Inequality (9) becomes true for all 9* E M. if c(0) is replaced by k(0). This 
is our adaptive risk bound for matrix isotonic estimation, proved in Theorem 


2.3: 



( 10 ) 


where C is a universal positive constant. As a consequence of this inequality, 
we obtain that the risk of the LSE converges to zero at the parametric rate 
<7 2 /t7, up to logarithmic multiplicative factors, provided k(0*) is bounded 
from above by a constant. 

We also establish a property of the LSE that we term variable adaptation. Let 
C n 1 := {6 E M" 1 ■ 9i < ■ ■ ■ < 9 ni }. Suppose 9* = {9*-) E A4 has the property 
that 9*j only depends on i, i.e., there exists 6* E C ni such that 9*- = 9* for 
every % and j. If we knew this fact about 9*, then the most natural way of 
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estimating it would be to perform vector isotonic estimation based on the 
row-averages y := (yi, ■ ■ ■, y ni ), where y % := Y^=\Vijl n ^ resulting in an 
estimator 0 of 6 *. Using the vector isotonic risk bounds (2) and (3), it is 
easy to see then that the risk of 0 has the following pair of bounds: 


R{6*,0) < C 


a 2 ^D[6* 


2/3 


+ 


cr J logni 


n 


n 


( 11 ) 


and 


R(0*, 0) < inf 
6ec ni 


r-g|| S 

n i 


+ 


a 2 k(9) 

n 



( 12 ) 


Note that the construction of 6 requires the knowledge that all rows of 6* are 
constant. As a consequence of the adaptive risk bound (10), we shall show in 
Theorem 2.4 that the matrix isotonic LSE 6 achieves the same risk bounds as 
6 , up to additional logarithmic factors. This is remarkable because 6 uses no 
special knowledge on 6 *; it automatically adapts to the additional structure 
present in 6*. 


Note that in the connection between matrix isotonic estimation and bivari¬ 
ate isotonic regression, the assumption that 0*- = f(i/ni,j/n 2 ) does not 
depend on j is equivalent to assuming that / does not depend on its second 
variable. Thus, when estimating a bivariate isotonic regression function that 
only depends on one variable, the LSE automatically adapts and we get risk 
bounds that correspond to estimating a monotone function of one variable. 
This is the reason why we refer to this phenomenon as variable adaptation. 
To the best of our knowledge, such a result on automatic variable adaptation 
in multivariate nonparametric regression is very rare — most nonparametric 
regression techniques (e.g., kernel smoothing, splines) do not exhibit such 
automatic adaptation properties. 


The proof techniques employed in this paper are quite different from the 
case of vector isotonic estimation. In the vector problem (1), the LSE has 
the closed form expression (see e.g., Robertson et ah [19, Chapter 1]): 

1 " 

6, min max- > y,. (13) 

v>i u<i V - U + 1 ^ 
i=u 

This expression, along with some martingale maximal inequalities, are cru¬ 
cially used for the proofs of inequalities (2) and (3); see e.g., Zhang [26] 
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and Chatterjee et al. [7]. The LSE (6) in the matrix estimation problem also 
has a closed form expression similar to (13): 


On j = min max yrnu 
J LeC:(i,j)eL UeU:(i,j)eU yu ' u 


(14) 


where C and U denote the collections of all lower sets and upper sets respec¬ 
tively and i/A is the average of {y l3 : ( i,j ) 6 A }; see Robertson et al. [19, 
Chapter 1] for the definitions of upper and lower sets and for a proof of (14). 
This unfortunately is a much more complicated expression to directly work 
with compared to (13). It is not clear to us if simple martingale techniques 
can be used in conjunction with the expression (14) to prove risk bounds for 
the LSE. 

We therefore abandon the direct approach based on the expression (14) and 
instead resort to general techniques for LSEs in order to prove our results. 
Specifically, we use the standard empirical process based approach to prove 
the worst case bound (7). This approach relies on metric entropy calculations 
of the space of isotonic matrices. Metric entropy results for classes of isotonic 
matrices can be derived from those of bivariate coordinate-wise nondecreasing 
functions. However existing metric entropy results for classes of bivariate 
nondecreasing functions (as in Gao and Wellner [12]) require the functions 
to be uniformly bounded. Because of this reason, these results are not directly 
applicable to our setting. We suitably extend these results in order to allow 
for the lack of a uniform bound. On the other hand, for the adaptive risk 
bound (10), we use connections between the risk of LSEs and size measures of 
tangent cones. Thus, our proofs borrow ideas from empirical process theory, 
approximation theory and convex geometry and are of independent interest. 

The rest of the paper is organized as follows. Our results are described in Sec¬ 
tion 2: Subsection 2.1 deals with the worst case risk bounds while Subsection 
2.2 focuses on the adaptive bounds. In Section 3, we provide the necessary 
background on the general theory of the LSEs, prove our main metric en¬ 
tropy results and present the proof of our main worst case upper bound. In 
Section 4, we discuss connections between risk of LSEs and appropriate size 
measures of tangent cones, and also present the proof of our adaptive risk 
bounds. Additional discussion is provided in Section 5. We have also included 
an Appendix which contains the proofs of certain auxiliary technical results 
used in the paper. 
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2. Main Results 

In this section we give risk bounds on the performance of the isotonic LSE G, 
defined in (6). We start with a generalization of (2) and then proceed to ex¬ 
hibit the adaptive risk behavior of 6. We end this section with a result on the 
variable adaptation property of the LSE which shows that 0 automatically 
adapts to the intrinsic dimension of the problem. 


2.1. Worst case risk bounds 

Our first main result establishes inequality (7) which gives an upper bound 
on the worst case risk of the matrix isotonic LSE 6. We will actually prove a 
slightly stronger bound than that given by inequality (7). We first need some 
notation. We define the variance of a matrix 6 as 

ni ri2 

^0) : =- 5 ) 2 ’ (is) 

i =1 j= 1 

where 9 = XT=i Xq=i Oij/ n is the mean of the entries of G. Note that V ( G ) < 
D(0) for every G e M. We also denote the set {1,..., 1} by [l] for positive 
integers l. 

The following theorem, proved in Section 3.3, gives an upper bound on the 
risk R(G*,G ) in terms of the quantity V(6*). Because V{0*) < D(G *), the 
conclusion of the theorem is stronger than inequality (7). 

Theorem 2.1. There exists a universal positive constant C such that for 
every 77 . 1 , 77.2 > 1 with n = n\ji 2 and G* 6 A4, 

R(G*, G) <C ( —(logn) 8 + ./^_L^J.(iog n ) 4 
\ n V n 

Ignoring constants and logarithmic factors, Theorem 2.1 states that the risk 
of the LSE at G* converges to zero at the rate n -1 / 2 as long as V(G*) is 
bounded away from zero. In the next result, proved in Appendix A.4, we 
argue that n -1 / 2 is also a minimax lower bound in this problem. This implies 
that the rate n^ 1 / 2 cannot be improved by any other estimator uniformly 
over the class {G* : V(G*) <V} for every constant V. 
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Theorem 2.2. For every positive real number D, 


I 2 7 ~) 

inf sup R(0,0) > \ j — (16) 

e oeM:D(o)<D * ly/n 


where the inhmum is over all estimators 6 of 6, provided the integers n i > 
1, tt -2 > 1 with n = n\ri 2 satisfy n > 9a 2 / D and 


min 


/n| n|\ 

\ri2 n i ) 


D 

> - . 

9a 2 


(17) 


Remark 2.1. The condition (17) is necessary to ensure that neither rii or 
7t, 2 are too small. Indeed, the inequality (16) is not true when, for example, 
77,! = 1, 7t, 2 = 77, because in this case the problem reduces to vector isotonic 
estimation where the minimax risk is of the order rC 2 ^ < n ~ 1 ^ 2 . When 
TT-i = tt -2 = y/n, the inequality (17) is equivalent to n > D/ {9a 2 ) which is 
satisfied for all large n. 

Remark 2.2. Recall the quantity V{0) defined in (15). Because V{0) < 
D(9), it follows that {6 : D{0) < D} C {6 : V(0) < D}. Therefore the 
bound (16) also holds if {0 : D{0) < I )} is replaced by the larger set { 9 : 
V{0) < D}. 


In addition to proving that the LSE is minimax optimal up to logarithmic 
factors, another interesting aspect of Theorem 2.1 is that when V{0*) = 0, 
the upper bound on R(9*,9) becomes the parametric rate a 2 /n up to a 
logarithmic factor. This rate is faster than the worst case rate rr 1 ^ 2 . Thus 
the LSE adapts to 9* e {9 : V(0) = 0}. A more detailed description of the 
adaptation properties of the LSE is provided in the next theorem. 


2.2. Adaptive risk bounds 

The adaptation properties of the matrix isotonic LSE are more subtle com¬ 
pared to the vector case. In the latter case, adaptation of the LSE is de¬ 
scribed by inequality (3). The most natural analogue of (3) in the matrix 
case is an inequality of the form (9). Unfortunately it turns out that this in¬ 
equality cannot be true for every 9* E M because it contradicts the minimax 
lower bound proved in Theorem 2.2. The reason for this is the following. Fix 
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G* = (0t.) g M with D := D(G*) = {G* niH2 - G* n ) 2 > 0. Now fix c > 1 and 
define G = ( Gij ) by 


Gij 9* n + 


Vd 


ci&lj ~ 0 


li - 


y/D 


It is easy to see that G G M. (because Q %3 is a nondecreasing function of 
0*j). Also for every i,j, we have G*- — \[D/c < G t j < G* 3 which implies that 
||G — 0*|| 2 < nD/c 2 . Finally c(G) < (c + 1). Therefore if inequality (9) were 
true for every G*, we would obtain 

R(G*, G) < p(\ogn) inf (^ + — — + — 
c>i \c z n 



Choosing c = [(nD/cr 2 ) 1 / 3 \, we would obtain that R(G*,G) converges to 
zero at the rr 2// ' 3 rate. This obviously contradicts the minimax lower bound 
proved in Theorem 2.2. Therefore, one cannot hope to prove an inequality of 
the form (9) for every 6* G A4. 

The fact that inequality (9) is false means that the LSE 0 does not adapt to 
every G* G M. with small c(0*). However, inequality (9) can be proved for 
every G* G M. if the quantity c(G) is replaced by the larger quantity k{0) 
- the number of rectangular blocks — as defined in the Introduction. We 
are now ready to state our main adaptive risk bound for the matrix LSE; see 
Section 4.2 for its proof. 


Theorem 2.3. There exists a universal constant C > 0 such that for every 
G* E Ai we have 


R(G*,G)< inf 

oeM 



+ 


Ck(G)a 2 

n 


(logn) 8 


(18) 


Remark 2.3. Note that 1 < k{0) < n for all G G M. There exist G G M. 
for which c(0) = k(0). These are elements G G A4 whose level sets (level sets 
of G are non-empty sets of the form {(i,j) : G %3 = a} for some real number 
a) are all rectangular. 

Remark 2.4. A simple consequence of Theorem 2.3 is that R(G*,G) is 
bounded by the parametric rate (up to logarithmic factors) when k(0*) is 
bounded from above by a constant. To see this, simply note that we can take 
G = G* in (18) to obtain 

R{0\0) < Cilognf ^ 6 ^ 2 . 
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The right hand side above is just the parametric rate a 2 /n up to logarithmic 
factors provided k(0*) is bounded by a constant (or a logarithmic factor of 
n ). 

Remark 2.5. Inequality (18) sometimes gives near parametric bounds for 
R(0*,0) even when k(0*) = n. This happens when 0* is well approximated 
by some 6 E M. with small k(0). An example of this is given below: Assume, 
for simplicity, that n\ = ri 2 = \fn = 2 k for some positive integer k. Define 
0* e M niX?l2 by 


0* VJ = —(2-* + 2~ j ) for 1 < i, j < m. 

It should then be clear that 0* E A4 and k(0*) = n. Also, let us define 
0 E M by 

0ij = + 2- (jAfc) ) for l<i,j<m, 

where a A b := min(a, b). Observe that k{0) < (k + l) 2 < Ologn. Further 

-110 - 0*11 2 < max(0j 7 - - 0*-) 2 < 2 (2" 2fc + 2~ 2k ) = 
n ( i,j ) J J v J n 

Theorem 2.3 therefore gives 

R{0*,0) < O {- + —(logn) 9 } . 

f n n J 

This is the parametric bound up to logarithmic factors in n. 


2.3. Variable adaptation 

In this sub-section we describe a very interesting property of the LSE which 
shows that 0 adapts to the intrinsic dimension of the problem. Suppose that 
0* e M is such that its value does not depend on the columns, i.e., there 
exists 9* E C ni (recall that C ni = {6 1 * E R" 1 : 6\ < ■ ■ ■ < 0^}) such that 
0*ij = @i f° r every i and j. Note that in connection to bivariate isotonic 
regression, the assumption that 0F := f(i/n 1 ,j/n 2 ) does not depend on j 
is equivalent to assuming that / does not depend on its second variable. 
If we knew this fact about 0*, then the most natural way of estimating it 


imsart-generic ver. 2011/11/15 file: IsoMatrixEstBern.tex date: November 3, 2015 


Chatterjee, S., Guntuboyina, A. and Sen, B./Matrix estimation under monotonicity 13 


would be to perform vector isotonic estimation based on the row-averages 
y := (yi,..., y ni ), where yi := J2j=i y%j/ n ' 2 , resulting in an estimator 0 of 
0*. This oracle estimator has risk bounds given in (11) and (12). 


The following theorem, proved in Section 4.3, shows that the matrix isotonic 
LSE 0 achieves the same risk bounds as 0 , up to additional multiplicative 
logarithmic factors. This is remarkable because 0 uses no special knowledge 
on 0 *; it automatically adapts to the additional structure present in 0*. Thus, 
when estimating a bivariate isotonic regression function that only depends 
on one variable, the LSE automatically adapts and we get risk bounds that 
correspond to estimating a monotone function in one variable. As mentioned 
in the Introduction such a result on automatic variable adaptation in multi¬ 
variate nonparametric regression is very rare. 

Theorem 2.4. Suppose 0* = (0*j) £ M and 9* = ( 9 '*) £ C ni are such that 
0*ii = f° r a ll 1 < i < rti and 1 < j < n-i- Then the following pair of 
inequalities hold for a universal positive constant C: 


R(0*,0)< inf 

eec rai 


\\9*-6f Ck(6)a 2 . . 

--— + ——(logn) 


n i 


n 


(19) 


and 


R(0*,0) < C(logn) 


a- 




2/3 


n 


provided nD{0*) > 2a 2 . (20) 


3. General theory of LSEs, Metric Entropy Calculations and the 
proof of Theorem 2.1 

This section is mainly devoted to the proof of Theorem 2.1. The general 
theory of LSEs under convex constraints is crucially used to prove Theorem 
2.1. Parts of this general theory that are relevant to the proof of Theorem 2.1 
are recalled in the next subsection. Essentially, this general theory reduces the 
problem of bounding R(0*, 0) to certain metric entropy calculations of classes 
of isotonic matrices. In Subsection 3.2, we prove such results by extending 
appropriately existing metric entropy results for bivariate coordinate-wise 
nondecreasing functions due to Gao and Wellner [12]. Finally, in Subsection 
3.3, we complete the proof of Theorem 2.1 by combining the metric entropy 
results with general results on LSEs. 


imsart-generic ver. 2011/11/15 file: IsoMatrixEstBern.tex date: November 3, 2015 






Chatterjee, S., Guntuboyina, A. and Sen, B./Matrix estimation under monotonicity 14 


3.1. General Theory of LSEs 


The following result due to Chatterjee [9, Corollary 1.2] is a key technical tool 
for the proof of Theorem 2.1. It reduces the problem of bounding R(G*,G) 
to controlling the maximizer of an appropriate Gaussian process. 

Theorem 3.1 (Chatterjee). Fix 6* E M. Let us define the function fg* : 
M + -4 R as 




E 


sup 

0eM:\\0*-0\\<t 


EE- 

1=1 j=1 


-IJ 


(0« 



e 

T 


( 21 ) 


Let tg* be the point in [0, oo) where 1 1 —> fg*(t ) attains its maximum (existence 
and uniqueness of tg * are proved in Chatterjee [9, Theorem 1.1]). Then there 
exists a universal positive constant C such that 

„ C 

R(9* , 0) < — max (tg*, a 2 ) . (22) 

n 


The above theorem reduces the problem of bounding R(9*,G) to that of 
bounding tg*. For this latter problem, Chatterjee [9, Proposition 1.3] observed 
that 

tg* < t** whenever t** > 0 and fg*(t**) < 0 . 

In order to bound tg*, one therefore seeks t** > 0 such that fg*{t**) < 0. This 
now requires a bound on the expected supremum of the Gaussian process in 
the definition of fg*(t ) in (21). 

It will be convenient below to have the following notation. For n\ X7i2 matrices 
M,N G W niXn2 , let ||iW — iV|| denote the Frobenius distance between M 
and N defined by 

Til 722 

\\M - Nf ■ 

i =1 j= 1 


For a subset J- C HJ n i xn 2 anc j e > 0 , let N(e, T) denote the e-covering number 
of T under the Frobenius metric || • || (i.e., N(e,T) is the minimum number 
of balls of radius e required to cover T). Also, for each G* G A4 and t > 0, 
let 


B(0*,t) :={6eM: ||0-0*|| < t} 


(23) 
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denote the ball of radius t around G*. Observe that the supremum in the 
definition of (21) is over all 6 e B(0*,t). Finally let 

ni U2 

(e, 0(0 y -«•,) . 

i =1 j =1 

The following chaining result gives an upper bound on the expected suprema 
of the above Gaussian process (see e.g., van de Geer [22]); see [8] for a proof. 

Theorem 3.2 (Chaining). For every G* e M and t > 0, 


E 


sup (e, 6 — 6 *) 


< cr inf 

0<<5<2t 


12 


V log N(e, B(G* } t)) de + 4 byfn 


The general results outlined here essentially reduce the problem of bound¬ 
ing R(G*,G ) to controlling the metric entropy of subsets of M. of the form 
B(G*,t). Such a metric entropy bound is proved in the next subsection. This 
is the key technical component in the proof of Theorem 2.1. 


3.2. Main metric entropy result 

Let 0 denote the n\ x n 2 matrix all of whose entries are equal to 0. According 
to the notation (23), we have 


B{ 0 , 1) = {6> G M : ||0 - 0|| < 1} = lG 6 M : ^ ^ Q% < U . (24) 

l i=\ j= 1 J 

The next theorem gives an upper bound on the e-covering number of B{ 0,1) 
(all covering numbers will be with respect to the Frobenius metric || • ||). It 
will be crucially used in our proof of Theorem 2.1. 

Theorem 3.3. There exists a universal positive constant C such that the 
following inequality holds for every e > 0 and integers ni,n 2 > 1: 

2 


log 7V(e, 5(0,1)) <C 


(logm) 2 (log n 2 f 


log 


4^/log rzi log n 2 


Moreover for every 0 < S < 1, 


y/logN(e,B(0 ,1 ))de < ^(logni)(logn 2 ) ^log 4 ^ log ^ log U ' 2 


(25) 


( 26 ) 
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There is a close connection between metric entropy results for isotonic matri¬ 
ces and those for bivariate coordinate-wise nondecreasing functions. Indeed, 
for every isotonic matrix 6 , we can associate a bivariate coordinate-wise non¬ 
decreasing function cpg : [0, l] 2 —> M via 

4>o{x 1 ,X 2 ) := min [0^ : mxi < i < ni,n 2 x 2 < j < n 2 j 
for all ( Xx,x 2 ) G [0, l] 2 . It can then be directly verified that 

\\0 - u\\ 2 = n / / (cj)0(x 1 ,x 2 ) - (p^(xi,x 2 )) 2 dx 1 dx 2 

Jo Jo 

for every pair 6, u of isotonic matrices. This means that metric entropy re¬ 
sults for classes of isotonic matrices can be derived from those of bivariate 
nondecreasing functions. However existing metric entropy results for classes 
of bivariate nondecreasing functions (see Gao and Wellncr [12]) require the 
functions to be uniformly bounded. If the average constraint in the definition 
(24) of 5(0,1) is replaced by a supremum constraint i.e., if one considers the 
smaller set 5^(0,n _1/2 ) := {6 G M : sup 1 < i < ni l < J < n2 \6 rij \ < n _1/2 }, then 
the metric entropy of 5^(0, n _1//2 ) can be easily controlled via the results of 
Gao and Wellncr [12], This is the content of the following lemma where we 
actually consider the classes 


Boc(0,t) := < 0 G M : sup \0- L j\ < t > 

l l<i<ni,l< < 7<ri2 J 

for general t > 0 . 

Lemma 3.4. There exists a universal positive constant C such that 


logiV(6,5 oo (0,t)) < C 
for every t > 0 and e > 0 . 


ty/n 


log 


ty/n 


Lemma 3.4 does not automatically imply Theorem 3.3 simply because the 
class .5(0,1) is much larger than - 600 ( 0 , n _1//2 ). Nevertheless, it turns out that 
the entries 0, 3 of a matrix 6 in - 6 ( 0 , 1 ) are bounded provided min(i — 1 , n x — 1 ) 
and min (7 — l,n 2 — j ) are not too small. This is the content of Lemma 3.5 
given below. 
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Lemma 3.5. The following holds for every G e -8(0,1) and 1 < i < ni, 1 < 

j < n 2 : 



(27) 


Using Lemma 3.5, we employ a peeling-type argument to prove Theorem 
3.3 where we partition the entries of the matrix 6 into various subrectangles 
and use Lemma 3.4 in each subrectangle. The complete proof of Theorem 3.3 
along with the proofs of Lemma 3.4 and Lemma 3.5 are given in Appendix A. 

3.3. Proof of Theorem 2.1 

We provide the proof of Theorem 2.1 here using the results from the last two 
subsections. 

Fix 6* G A4 and let fg*(-) be defined as in (21) with t e * being the point in 
[0, oo) where t H > fg*[t ) attains its maximum. 

Let 6* denote the constant matrix taking the value Oij/ n i i- e -> 


9* k i = 2j=i 0ij/ n f° r 1 < k < ni and 1 < l < n 2 . Writing G 
6 — 0* + 9*, we have 



for every t > 0. Taking expectations on both sides with respect to e, we 
obtain 



(28) 


Now by the triangle inequality, it is easy to see that 


B(0*, t) C B [g *, rtj where r t := t + \J nV (0*). 


This and (28) together imply that 
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Because 0* is a constant matrix, it is easy to see that 

sup (e,6 — 6*)= sup ( e,0)=r t sup (e, 6) 

0S-B(0Vt) ' 0GB(O,r t ) 0SB(O,1) 

where 0 denotes the constant matrix with all entries equal to 0 . 

As a consequence, we have 

t 2 

fe*(t ) <r t E sup (e, 9) — — for all t > 0. (29) 

0gB(o,i) 2 

We now use Theorem 3.2 with S = 1 /s/n to obtain 

/•2 

E sup (e, 9) < 12a / A/log N(e, B( 0,1 ))de + 4a. 

6^B( 0,1) Jl/y/n 

Inequality (26) with 5 = n " 1 / 2 then gives 

E sup (e, 6) < Co [Ai/ogiyB^/n)) 2 + l) 

0 eB(o,i) 

with A := (logni)(logn 2 ) and B := 4(log n/) (logn 2 ). 

Thus, letting g(t) := Cr t o (A(\og(B\/n)) 2 + 1 ), we obtain from (29) that 

fo*(t) < g(t) - l — for alH > 0 . 

It can now be directly verified that 

fe*(t**) < g(t**) - i( t**) 2 < 0 for t** := 2C^ 2 + 1 {nV{6*)) 1 / 2 

where 7 := 0 (A(log(B^/ : n)) 2 + 1). Inequality (22) in Theorem 3.1 therefore 
gives 

R(6, 6*) < — max ((t**) 2 , o 2 ) . (30) 

n 

Now (t**) 2 = C(a 2 + 'yyJnV(9*)) and using the expressions for A and B, it 
is easy to see that (note that n > 1 because n 1 , n 2 > 1 ) 

7 = a (A(log(By/n)) 2 + l) < Co (logn ) 4 . 

This, along with (30), allows us to deduce 

R(6,9*) < C ^^-(logn) 8 + ^Mldogu) 4 ) 

which proves Theorem 2.1. 
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4. Risk, Tangent Cones and the Proofs of Theorems 2.3 and 2.4 

This section is devoted to the proofs of Theorem 2.3 and Theorem 2.4. We 
use a recent result of Bellec [3] on the connection between the risk R(9*, 9) 
and certain size measures of tangent cones to M. at 9*. This result is recalled 
in the next subsection. 


4.1. Risk and tangent cones 

Fix 9 e M. The tangent cone of M. at 9 will be denoted by T M (9) and is 
defined as the closure of the convex cone generated by u — 9 as u varies over 

M i.e., 

T m {9 ) := closure {a(u — 9) : a > 0 and u E M.} . 

The tangent cone Tm(9) is a closed, convex subset of M n = W rllXn2 . Observe 
that if 9 is a constant matrix (i.e., all entries of 9 are the same), then Tm(9) 
is simply equal to A4. 

It turns out that the risk R(9*,9) can be controlled by appropriate size 
measures of the tangent cones T_m(9),9 e A4. This is formalized in the 
following lemma. This lemma is similar in spirit to results in Oymak and 
Hassibi [17]. More general such results involving model misspecihcation have 
recently appeared in Bellec [3]. 

Let e = (e^) denote the n± x n 2 matrix all of whose entries are independent 
and normally distributed with zero mean and variance a 2 . The Euclidean 
projection of e onto the tangent cone T M (9) is defined in the usual way as 

H(e,T M {9)) := argmin ||e — u\\ 2 . 
u&Tm(S) 

Lemma 4.1. For every 9* e A4 we have 

R(9*. 9) < — inf (||6F — 0|| 2 + E||n(e,T^(6O)|| 2 Y (31) 

where the expectation on the right hand side is with respect to e. 
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Proof. Recall that y = 9* + e and that 9 is the projection of the data matrix 
y onto AT By the usual KKT conditions, this projection 9 satisfies 

(y — 9,9 — 0^ >0 for every 9 G A4 

where ( A, B) = i a ijhj for A = (a^) and B = (%). This inequality 

implies that 

||y — 9 1 | 2 > ||y — 0 || 2 + ||0 — 0 || 2 for every 9 G AT 

Writing y = 9* + e, expanding out the squares and rearranging terms, we 
obtain 

||0* - 9\\ 2 + ||e|| 2 + 2(9* -9,e) > ||0* - 0 || 2 + ||e || 2 + 2(9* -9,e) + \\9- 9\\ 2 
i.e., ||0-0 *|| 2 < 2(9-9,e)-\\9 - 9\\ 2 +\\9* - 9\\ 2 

i.e., J|0 - 0*|| 2 < || 0 * -0 || 2 + ||e|| 2 - ||e- ( 0 - 0 )|| 2 . 

Because 9 G J4, the matrix 9 — 9 belongs to the tangent cone Tm(9). We 
therefore get 

||0 — 0*|| 2 < || 0 * — 0|| 2 + ||e || 2 — inf ||e-u|| 2 . 

u£Tm(0) 

The inhmum over u above is clearly achieved for u := Il(e, T^(0)) and hence 

||0 - 0 *|| 2 < ||0* - 0 || 2 + ||e || 2 - ||e - Il(e,Tv,(0 ))|| 2 (32) 

Because Tm(9) is a closed convex cone, the projection II(e, Tvj(0)) satisfies 
(see, for example, [16, Equation (4)]): 

(e-n(e,T M (0)),n(e,T M (0)))=O. 

The above equality and inequality (32) together imply that 

I|6>-6>1 2 < ||0* -0|| 2 + ||n(e,T M (0))|| 2 . 

The required inequality (31) now follows by taking expectations on both 
sides. □ 

Inequality (31) reduces the problem of bounding the risk to controlling the ex¬ 
pected squared norm of the projection of e onto the tangent cones Tm(9), 9 G 
AT This will be crucially used in the proof of Theorem 2.3. 
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4.2. Proof of Theorem 2.3 

We provide the proof of Theorem 2.3 in this subsection. The first step is 
to characterize the tangent cone T M {9) for every 0 G M. We need some 
notation here. For a subset S of {(i,j) : 1 < i < ni, 1 < j < n 2 }, let 
denote the class of all real-valued functions from S to M. Elements of M' s will 
be denoted by (9 t j, ( i,j ) G S). We say that (9 l3 : (i,j) G S) is isotonic if 

6,j < 9 k i whenever (i,j), (k, l) G S with i < k and j < l. 

The set of such isotonic sequences in M 5 will be denoted by M(S). Also for 
every two dimensional array 6 = (0^ : 1 < i < rii,l < j < n 2 ), let 


G(S) := (0 y : (i,j) G S). 


Observe that 0(S) G M(S) if 6 G M.. The following lemma provides a useful 
characterization of T^{0) lor 0 G M. Recall that a rectangular partition of 
[ni] x [n 2 ] is a partition of [n/\ x [n 2 ] into rectangles. The cardinality 17r| of a 
rectangular partition 7 r equals the number of rectangles in the partition. The 
collection of all rectangular partitions of [ni] x [n 2 ] is denoted by V. We say 
that 6 G A4 is constant on 7 r = (Ai,..., A k ) G V if {6;, 3 : (i,j) € is a 
singleton for each l. 

Lemma 4.2. Fix 6 G M. and 7r = (Ai,..., A k ) G V such that 9 is constant 
on 7T. Then 

Tm(0) C {v G M n : u(Aj) G A4{Ai) for each i — 1 ,..., k} . (33) 

Proof. Suppose that v = a(t — 9) for some f 6 M and a > 0. This means 
that v(A/) = a(t(Ai) — 9(A/)) for each i. Because t(A/) G M.(A/) and 0(Aj) 
is a constant (0 is constant on 7r), we now have v(A/) G A4(Aj). As the 
right-hand side of (33) is a closed set, and T M (9) is the closure of all such 
ids, the desired result follows. □ 

Remark 4.1. Note that we did not use the fact that A \,..., A k are rectan¬ 
gular in Lemma 4.2. We only used the fact that 0 is constant on each Aj. 
This means that (33) is true also when A 1; ..., A k are the levels sets of 0 i.e., 
each Ai = {( i,j ) : 9 i3 = a} for some real number a. In fact, when A k ,..., A k 
are the level sets of 0, we have equality in (33). This can be proved as follows. 
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Suppose that v(A/) G A4(A) for each i. We shall argue then that 6+av G A4 
for some a > 0 which, of course, proves that v G Tm(6). Observe first that 
A\,...,Ak form a partition of [ni] x [ n/\■ Let D denote the collection of 
all pairs ((i, j), (k,l)) such that i < j and k < l and 0,j ■=/ 0 k i. Note, in 
particular, that ( i,j ) and (k,l) belong to different elements of the partition 
A 1 ,...,A k if ((i, j), (fc,/)) G D. Let 

a \= min j ® kl _ : (( i,j ), (k, /)) G D and v t] > v ki 

f Hij 13 k i 

By monotonicity of 6 , it is clear that a > 0. With this choice of a, it is 
elementary to check that 6 + av G M.. This shows that (33) is true with 
equality when A\,... ,A k are the level sets of n. 

We now have all the tools to complete the proof of Theorem 2.3. 

Proof of Theorem 2.3. The first step is to observe via inequality (31) that it 
is enough to prove the existence of a universal positive constant C for which 

E||Ll(e, Tm ^))!! 2 < Ck(0)a 2 (logn) 8 for all 6 G M. 

From the definition of k(6), it is enough of prove that 

E||n(e,T^( 6>))|| 2 < Cka 2 (\ognf (34) 

for every it = (Ai,..., A k ) G V such that 6 is constant on tt. To prove (34), 
use the characterization of Tm(6) in Lemma 4.2 to observe that 

k 

E||Ll(e, Tm(6))\\ 2 < ^E||n(e(A),Wl(A))|| 2 . (35) 

i =1 

The task then reduces to that of bounding E||Ll(e(Aj), A4 (Aj))|| 2 for i = 
1,..., k. It is crucial that each Ai,.. ., A k is a rectangle. Fix 1 < % < k and 
without loss of generality assume that Ai = [ri\ ] x [n' 2 \ for some 1 < n\ < n\ 
and 1 < n' 2 < n 2 . It is then easy to see that Theorem 2.1 for 9* = 0 and 
ni = n \, 77.2 = n' 2 immediately gives 

E\\U(e(Ai),M(Ai))\\ 2 < C<7 2 (log(2n' 1 n 2 )) 8 (36) 

for a universal positive constant C as long as n\ > 1 and n 2 > 1. When 
n\ — n 2 — 1, it can be checked that the left hand side of (36) equals a 2 which 
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means that (36) is still true provided C is changed accordingly. Finally when 
mir^n), n' 2 ) = 1 and max(ri / 1 , n' 2 ) > 1 , one can use the result ( 2 ) from vector 
isotonic estimation to prove (36). We thus have 

E||n(e(A),>t(A))|| 2 <^ 2 (logn) 8 

for a universal constant C for all n\ > 1 and n' 2 > 1 . This inequality together 
with inequality (35) implies (34) which completes the proof of Theorem 2.3. 

□ 


4.3. Proof of Theorem 2-4 


We now give the proof of Theorem 2.4. Let us first prove inequality (19). 
For 9 G C ni , let T (9) G A4 be defined by T (9)^ = 9, for all 1 < i < rii and 
1 < j < n 2 . Also let T (C ni ) := (T(6 I ) : 9 G C ni }. Note first that all level sets 
of Y(0) are rectangular for every 9 G C ni which implies that k(T(9)) = k{9) 
for every 9 G C ni . Therefore, as a consequence of Theorem 2.3, we obtain 
that for every 9* G A4, 


R(0*. 6) < inf 

d£C ni 



- T ( g )|| 2 | 

n 


Ck(9)o 2 


n 


(log nf 


Now if there exists 9* G C ni such that T(9*) = 0* : , then it is obvious that 
||0* - T( 6»)|| 2 = n 2 \\9* - 9 \\ 2 which proves (19). 


Inequality (20) can now be derived from (19) by a standard approximation 
argument. For every 9* G C ni with D = D(9*) = (9* n —9\) 2 and 0 < 5 < \J~D, 
there exists 9 G C ni with 


\\9-9*\\ 2 

ni 


< 5 2 


and 


k(9) < 


< 1\Td 

5 


This fact is easy to prove and a proof can be found, for example, in Chatterjee 
et al. [7, Lemma B.l], Using this, it follows directly from (19) that 


R(0*, 0) < C inf U 2 + 
0<5<Vd 


2 o 2 y[D 
nS 


(logn) 1 '' 


The choice 5 = [2o 2 \fD /n ) 1 / 3 now leads to inequality (20). This choice of 5 
satisfies 5 < \Z~D provided nD > 2a 2 . This completes the proof of Theorem 


2.4. 
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5. Discussion 

In this paper we have studied the risk behavior of the LSE of an unknown 
matrix 6*, constrained to be nondecreasing in both rows and columns, when 
observed with errors. We prove both worst case and adaptive risk bounds 
for the LSE. A highlight of the adaptation properties of the LSE is that it 
adapts automatically to the intrinsic dimension of the problem. 

Two further research questions are mentioned below. 

The logarithmic factors in our risk bounds, e.g., in Theorems 2.1 and 2.3, 
are probably not optimal. They arise as a consequence of (i) the presence of 
logarithmic factors in the covering number result in Gao and Wellner [12] (see 
the proof of Lemma 3.4), and (ii) the fact that the entropy integral in (26) 
in Theorem 3.3 diverges to +oo if 5 / 0. It is not clear to us at the moment 
how to remove or reduce these logarithmic factors. 

In this paper we deal with the estimation of an isotonic matrix. It is natural 
to ask how the results generalize to isotonic tensors of higher order, and 
more generally to estimating a multivariate isotonic regression function under 
general designs. It would be interesting to see whether such adaptation results 
hold in these situations. 
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Appendix A: Appendix 
A. 1. Proof of Lemma 3.4 

For each 6 e B 0 o (0, t ), we associate a bivariate coordinate-wise nondecreasing 
function />o '■ [0, l] 2 —> M via 

4>o{x i,x 2 ) := min {0^ : mxi < i < n l ,n 2 x 2 < j < n 2 } 
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for all (xi,X 2 ) G [0, l] 2 . It can then be directly verified that 


1 /■! 


\e 


v\\ = n 



((f>o{xi,x 2 ) - <p u (x 1 ,x 2 )) 2 dx 1 dx 2 


o jo 


for every pair G,u G 5^(0, t). Moreover, if C([0, l] 2 ,t) denotes the class 
of all bivariate coordinate-wise nondecreasing functions that are uniformly 
bounded by t, then it is straightforward to verify that </o G C([0,l] 2 ,t) for 
every G G -8^(0, t). These two latter facts immediately imply that 

N(e, 5 00 (0, t)) < N (n-^e/ 2, C([0, l] 2 , t), L 2 ) (37) 

where N(e/2,C([0, l] 2 ,t),8 2 ) denotes the e/2-covering number of C([0,1 ] 2 ,t) 

1 /9 

under the L 2 metric L 2 (f,g) (/(/ — g ) 2 ) • This latter covering number 

has been studied by Gao and Wellner [12] who proved that 


N(e/2,C([0,l} 2 ,t),L 2 )<C[ f - 


log I - 


for a universal positive constant C. This and (37) together complete the proof 
of Lemma 3.4. □ 


A.2. Proof of Lemma 3.5 

Fix 6 G .8(0,1) and 1 < % < rii, 1 < j < n 2 . Our proof of (27) involves 
considering the following two cases separately: 

1. Ojj < 0: Here, by monotonicity of G, the inequality O^i < G tJ must hold 
for all 1 < k < i and 1 < l < j. Therefore, \0ki\ > G %] \ holds for all 
(k,l) G [1,7] x [1, j]. Finally because G G -8(0,1), we have 

l>J2Y. 0 b>O e l- 

k =1 1=1 

This proves (27) when Gij < 0. 

2. Gij > 0. Here by monotonicity of G. the condition Gf.j > G t] must hold 
for all i < k < rii and j < l < n 2 . Therefore, by nonnegativity of G^ 
and by virtue of G G -8(0,1) we have 

n\ ri2 

1 > XJ e h > ( n i + 1 - *)(«2 + 1 - j)0%- 

k=i l=j 
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This proves (27) when Oij > 0. 


□ 


A.3. Proof of Theorem 3.3 

The basic idea behind this proof is the following. By Lemma 3.5, it is clear 
that for every matrix 6 G 5(0,1), the entries 6 tJ are bounded by constants 
provided min(i — 1, n\ — i) and min(j — 1, tt, 2 — j ) are not too small. Further, 
for bounded isotonic matrices, the metric entropy bounds can be obtained 
from Lemma 3.4. We shall therefore employ a peeling-type argument where 
we partition the entries of 0 into various subrectangles and use Lemma 3.4 
in each subrectangle. 

Let us introduce some notation. Let B denote the set 5(0,1) for simplicity. 
For a subset S C {(z, j) : 1 < i < ni, 1 < j < rz 2 } with cardinality \S\ and 
0 G M, let 0(S) G be defined as 


6(S) := (0 y : (z, j) G S)). 

Further let Bs denote the collection of all 0(S) as 6 ranges over 5. The 
e-metric entropy of Bs (under the Euclidean metric on Rl 5 l) will be denoted 
by fV(e, Bs). 

We first prove inequality (25). Let I\ := {i : 1 < i < rii/2} and J 2 := {z : 
ni/2 < z < ni}. Also Ji := {j : 1 < j < n 2 / 2} and J 2 : = {j '■ zz 2 /2 < j < 
n 2 }. Because 


|0-a|| 2 = \\G(I k x Ji) - a(4 x J L 

l<k,l<2 


for all 6 and «. it follows that 


2 2 


log N(e, 5) < EE log N(e/2, B 


Ik'XJl; 

k =1 /=1 

We shall prove below that for every 1 < k, l < 2 and e > 0, 


logN(e/2,B IkXj ,)<C 


(logn 1 ) 2 (logij 2 ) 


2 r 


log 


4i/log ni log n 2 


-i 2 


(38) 
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for a universal positive constant C. This would then complete the proof of 
(25). 

Let ki and k 2 denote the smallest integers for which 2 kl > ni/2 and 2 fca > 
77 , 2 / 2 . For every 0 < u < k\ and 0 < v < k 2 , let 

A T l := {i E h : 2“ < i < min(2 u+1 - l,m/2)} and 
iV/ := {j E J, : 2 V < j < mm(2 )J+1 - l,n 2 /2)}. 


Similarly let 

N 2 := {i E I 2 : 2 U < rii + 1 — i < min(2“ +1 — 1, ni/2)} and 
N 2 '■= {j e ; 2 W < 77,2 + 1 - j < min(2 1,+1 - 1, n 2 /2)}. 

For each pair 1 < k, l < 2, because 

fci — 1 k2~ 1 

n«(4 x j,) - a(i t x j,)ir = E iwn x o - a « x N)ii 2 

it— 0 u=0 

it follows that 

fci — 1 /C2 — 1 

logJV( £ /2,B 4xA )<EE log At(/c 1 1/2 l 2 - 1/2 e /2,B lY j xA , ; ). (39) 

it=0 v=0 

Now fix 0 < u < ki, 0 < v < k 2 and 1 < k, l < 2. We argue below that 
N(k/ 1/2 k/ 1/2 e/2,B Nt xJ vO can be controlled using Lemmas 3.4 and 3.5. Note 
first that the cardinality of N k x N l v is at most | N k \ | A/,1 < 2 U+V . We also 
claim that 

max I dij I < 2~ {u+v)/2 for all 6 E B. (40) 

i&Nh,j£Nl 

We will prove the above claim a little later. Assuming for now that it is true, 
we can use Lemma 3.4 for B N k xN i1 to deduce that 

log N(kC 2 ki' /2 e/2,B NtxK ) < chh \ 

for a universal positive constant C. Inequality (39) then gives 

pp ( 4k 1 / 2 , 1/2 \ 2 

logN(c/2,S, fcXJl )<C'^flogA_^_j . (41) 
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Because k\ is the smallest integer for which 2 kl > ni/2, we have 2 fcl_1 < n i/2 
which means that k\ < logni. Similarly /C 2 < log 77 . 2 . This together with (41) 
implies (38) which completes the proof of (25). The only thing that remains 
now is to prove (40). 

We first prove (40) for k — l — 1. By Lemma 3.5, we get that \G tJ I < (ij) 1/2 
for all for 0 G B and (i, j) G I\ x J\. Clearly miiijgjvi i = 2 U and minjgjv 1 i — 
2 V . This proves (40) for k = l — 1. A similar argument will also work for 
k = l = 2. For the case when k = 1, l = 2, note that 

max 0j 7 - < max < 2~^ u+v ^ 2 
NlxNf N%xN$ 


which follows from the monotonicity of G and (40) for k — l — 2. Similarly, 

min Ga > min > -2~^ u+v ^ 2 . 

7V1XJV2 TVixJVi 


Putting these together, we obtain (40) for k — 1, l — 2. A similar argument 
will work for k — 2, l — 1. This completes the proof of (25). 

For (26), simply observe that by (25), 


J \/logN(e, 5(0,1)) de < VcVA j - ^log — ^ de 


VcVa 


(logy) 2 - (log B) 2 


< 


VcVa 


log 


This completes the proof of Theorem 3.3. 


B 

J 

□ 


A./. Proof of Theorem 2.2 


We shall use Assouad’s lemma to prove Theorem 2.2. The following version 
of Assouad’s Lemma is a consequence of Lemma 24.3 of Van der Vaart [23, 
pp. 347], 

Lemma A.l (Assouad). Fix D > 0 and a positive integer d. Suppose that, 
for each r G {—1, l} d , there is an associated g T in A4 with D(g T ) < D. Then 


inf sup R(G, G) > 

0 0&M:D(0)<D 


d . 

— mm 

8 T^T 


T (r,r) 


min 


(i 


P. 


| TV 


). 
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where T(r, t) := Yli=i L{t; ¥" T 'i} denotes the Hamming distance between r 
and r and || • \\tv denotes the total variation distance. The notation P 9 for 
g G M. refers to the joint distribution of y tJ = g {j + e VJ , for i G [ni],j G [ 77 - 2 ] 
when (ejj) are independent normally distributed random variables with mean 
zero and variance a 2 . 


We are now ready to prove Theorem 2.2. 

Fix D > 0 and an integer k with 1 < k < mm(ni,n 2 )- Let mi and m 2 be 
defined so that k = \ni/mi\ = [^ 2 /^ 2 ]■ Let d = k 2 . We denote elements 
of {—1, l} d by ( t uv : u,v G [k] x [k]). For each such r G { — 1,1 } rf , we define 
g T G A4 in the following way. For i G [rii\ and j G [n 2 ], if there exist u, v G [fc] 
for which (m — l)mi < i < um\ and (v — l)m 2 < j < cm 2 , we take 


9a = 


Vd( 


u + v 


V 2k 



Otherwise we take g\ 3 = \[D. One can check that g T G A4 and D(g T ) < I) 
for every r G {—1, l} d . 

We shall now use Lemma A.l with d = k 2 and this collection {g T : r G 
{ —1, l} rf }. Note first that 


rai n 2 2 

" 2 (s r , 2 r ') = r £ £ (W, - 9 


n 


iij at] 


i =l i=l 


7: E E 


E 

2 < 

m\m 2 D 
9nk 2 


n ^ ^ ^ \ 9ij 9ij ) 

u,v€[k\ i:(u—l)mi<i<umi j:(v—l)m2<j<vrri2 

- E 

n ^ 36 k 2 

u,v£[k\ 


T(r,r'). 


Therefore, this implies that 


mm ■ 


£ 2 {g T ,g T ') mim 2 D 
9 nk 2 


tAt' T(t, t') 

To bound ||P g r — P gr /|| T y, we use Pinsker’s inequality because the Kullback- 
Leibler divergence D( P g r ||P T /) has a simple expression in terms of £ 2 (g r , g T )'- 


^n-r'WrV — 


') = 


n -2 f r r'\ m x m 2 D 

9,9 ) = ). 


4a 2 


36a 2 k 2 
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This gives 


min 

T(r,r')=l 


(i 


p„ 



6ka ) 


Lemma A.l then gives the lower bound for A := inf^ sup eGA ^ :£) ( e ^ <£ , R(0, 6) 
as given below: 

^ > mim 2 D f _ y/mim 2 D \ 

~ 72 n \ Qka J 

Because k = [rii/mi\ for z = 1,2, it follows that n^/ (k + 1) < rrii < rti/k for 
i = 1,2. This gives 


A > 


D 

72(k + l) 2 



D 

~ 288 k 2 



\JnD 

Qak 2 


where we have also used that k + 1 < 2k. The choice k = (nD/(9a 2 )Y^ now 
leads to A > cr\/D/(192y/n). This gives what we wanted to prove provided 
our choice of k satisfies 1 < k < min(ni, n 2 ). For this, it suffices to simply 
note that n > 9 a 2 /D implies that k > 1 and (17) implies k < min(ni, n 2 ). 
This completes the proof of Theorem 2.2. □ 
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